feat: Helm umbrella chart for observability-stack#2
Closed
kylehounslow wants to merge 65 commits into
Closed
Conversation
39a1d5f to
e56f0fc
Compare
Adds charts/observability-stack/ as an umbrella Helm chart using upstream dependencies: - opensearch 3.5.0 - opensearch-dashboards 3.5.0 - data-prepper 0.3.1 - opentelemetry-collector 0.147.0 - prometheus 28.13.0 values.yaml mirrors the existing docker-compose configuration. All 5 dependencies resolve and helm template renders successfully. Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
- Add OPENSEARCH_INITIAL_ADMIN_PASSWORD env var for OpenSearch 3.5+ - Override Data Prepper image to latest (chart default 2.8.0 lacks otlp source) - Explicitly disable SSL on Data Prepper server and peer_forwarder - Fix service name references for inter-component connectivity - Use otel/opentelemetry-collector-contrib image (required by chart) Validated: all 5 pods running on kind + finch (macOS arm64) Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
OSD was failing to authenticate to OpenSearch (401 No Authorization header). Added opensearch.username and opensearch.password to opensearch_dashboards.yml. End-to-end pipeline validated: curl → OTel Collector → Data Prepper → OpenSearch ✅ OSD UI accessible on port-forward ✅ TODO: centralize credentials via .env-style values (not hardcoded) Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Helm post-install/post-upgrade hook that runs the existing init-opensearch-dashboards.py script as a K8s Job. Creates: workspace, index patterns (logs/traces/service-map), trace-to-logs correlation, APM config, agent observability dashboard, overview dashboard, and saved queries. Script patched to read BASE_URL from env var for K8s service names. Validated: job completes in ~30s, all saved objects created. Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
- Image: sgguruda62324/opensearch-data-prepper:2.15.0-SNAPSHOT (matches docker-compose .env, includes ps48's prometheus auth PR #6595) - Correct experimental plugin syntax for DP 2.15 - Re-added prometheus remote-write sink to service-map pipeline - All pipelines initialized including RED metrics to Prometheus Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
- Add opensearch-credentials Secret template - Init job references secret via secretKeyRef instead of hardcoded values - Document that DP pipeline configs still need manual password sync (subchart values don't support Go templating) Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
CronJob runs every 2 minutes, sends 5 agent traces per run with realistic GenAI semantic convention attributes: - invoke_agent spans with gen_ai.agent.name - chat spans with gen_ai.request.model, token usage, provider - execute_tool spans with gen_ai.tool.name - Randomized models (gpt-4o, claude-sonnet-4-20250514, nova-pro) Validated: 20 spans indexed in OpenSearch from single canary run. Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Docker-compose .env uses opensearchstaging/opensearch:3.6.0 and opensearchstaging/opensearch-dashboards:3.6.0. Helm chart was using the official 3.5.0 images which are significantly behind. Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Missing explore, agentTraces, discoverTraces, discoverMetrics, query enhancements, new home page, and experimental features. Config now matches docker-compose opensearch_dashboards.yml. Plugins now loading: explore, agentTraces, observabilityDashboards, queryEnhancements, datasetManagement (54 total). Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
1:1 parity with docker-compose.examples.yml: - example-weather-agent (FastAPI + OTel instrumented) - example-events-agent - example-travel-planner (orchestrator) - example-mcp-server (mock tool server) - example-canary (periodic invocations with fault injection) All services, env vars, ports, and memory limits match compose. Images built locally and loaded into kind via finch save/load. Validated: all 5 agents running, canary invoking travel-planner with fault injection, traces flowing to OpenSearch. Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
- Gateway + HTTPRoute templates (replaces legacy Ingress) - Two supported providers: envoy (Envoy Gateway), aws (VPC Lattice) - Envoy: TLS via K8s secret (cert-manager or manual) - AWS: TLS via ACM certificate annotation - Disabled by default (gateway.enabled: false) - Contributors can add GCP/Azure support Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com> Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com> Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Init job was hardcoding 'opensearch:9200' but the actual service is 'opensearch-cluster-master:9200'. Pass OPENSEARCH_ENDPOINT env var from the job template. Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com> Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com> Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
- Add saved-queries-traces.yaml and saved-queries-metrics.yaml to chart - Add architecture.png as binaryData in ConfigMap - Mount all files to /config so init script can find them - Update overview dashboard on every run (not skip if exists) - 20 saved queries now load, architecture image embedded in dashboard Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com> Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
… kube-state-metrics
…quest, OTel CPU, spans dropped, Prometheus query latency)
…eline + OpenSearch health dashboards)
…of truth) to helm, preserve K8s dashboard call
…ect#107 (adds Data Prepper panels, fixes prometheus panels)
- 29 tests across 6 suites covering all custom templates - credentials, examples, gateway, init-dashboards, opensearch-exporter - Tests conditional rendering, custom values, labels, annotations - Wrapper script runs helm lint + helm unittest - Requires helm-unittest plugin
Runs on push/PR to main when charts/ or test/helm-test.sh change.
235046b to
ee6cc63
Compare
Matches upstream canary changes (shallow/normal/deep trace shapes).
Port anonymous auth from docker-compose to Helm/K8s deployment. Closes #5. Changes: - Add anonymousAuth.enabled toggle in values.yaml (default: false) - Create opensearch-security-config Secret with config.yml, roles.yml, roles_mapping.yml — anonymous_auth_enabled templated from values - Update OpenSearch Dashboards config with anonymous_auth_enabled and conditional savedObjects.permission.enabled via global values + tpl - Sync init script with docker-compose version (ANONYMOUS_AUTH_ENABLED env var, conditional anonymous role in workspace allowedRoles) - Pass OPENSEARCH_ANONYMOUS_AUTH_ENABLED env var to init-dashboards Job - Wire up Terraform anonymous_auth variable to Helm release - Add 6 helm-unittest tests covering both enabled/disabled states - Document usage in chart README Usage: helm install obs-stack charts/observability-stack \ --set anonymousAuth.enabled=true \ --set global.anonymousAuth.enabled=true Kiro/claude on behalf of @kylehounslow
…adation - Fixed k6 auth (manual Base64 header) and PPL query syntax - Test 002: 300 VUs, 0% errors, p95=16ms — no stress - Test 003: 1500 VUs, 0% errors, p95=2.28s — saturated - Breaking point estimated between 500-700 VUs for good UX
… panels - New saved-queries-self-monitoring.yaml: thread pool rejections, search latency, Prometheus query latency P99, Data Prepper buffer capacity, OTel dropped spans - OpenSearch Health dashboard: added thread pool rejections, active searches, search latency, fetch rate panels - Pipeline Health dashboard: added Prometheus range query latency P99 panel - init-dashboards-configmap: include new saved queries file
…rialization error
- Terraform module spins up m5.xlarge in same VPC as EKS cluster - k6 scripts hit OSD through ALB (real user path: TLS + WAF + ingress) - api-queries-alb.js: PPL, search, PromQL, dashboard loads, service map - run-remote.sh: upload scripts + run tests on EC2 - Previous tests via port-forward were bottlenecked by kubectl tunnel
…der 1000 VUs via ALB
…t 99% CPU is next - OSD scaled 1×100m → 3×2CPU: median latency 3s → 824ms, 0% errors - OpenSearch now the bottleneck: 99-100% CPU, search queue peak 34 - Hot threads: write/refresh contention from OTel Demo indexing - p95=14.57s at 1000 VUs — need to scale OpenSearch next
…g applied - OSD scaled to 3 replicas, 2 CPU / 2Gi (resolved OSD bottleneck) - Documented 3 OpenSearch scaling options: horizontal, dedicated search nodes, vertical - Official approach: separate index/search with remote store + search replicas - Recommended: start with 3 data nodes (Option A), simplest path
- singleNode: false, replicas: 3 - JVM heap: 1g → 2g (50% of 4Gi RAM) - CPU: 500m req / 2000m limit - EKS scaled to 4 nodes to fit the cluster
…4.5s), uneven shard distribution
…and scaling recommendations - Estimated concurrent user capacity by experience tier - 7-day and 30-day data volume projections - Scaling recommendations by user count with cost estimates - Load test history summary with key findings - Tracks what hasn't been tested yet
…lity - Exact commands for uploading scripts, running tests, monitoring, retrieving results - Current deployment state and access points - k6 script details and known issues - Key learnings and gotchas discovered during testing - File structure and next steps
…s (was 143) - number_of_replicas=2 gives every node a copy of every shard - Node-2 went from 4k to 53k queries (12.6x improvement) - 62% throughput improvement over single-node baseline - Remaining bottleneck: primary shard routing preference on Node-0
feat: Add anonymous authentication support to Helm chart
…rsistent config) - Ingress: HTTPS/443 with ACM cert, TLS 1.3, external-dns hostname - Health check: /app/login (unauthenticated, returns 200) - OSD: replicaCount (not replicas) — correct subchart key - OpenSearch: 4 CPU limit, 2 CPU request, 4Gi RAM, 2Gi JVM - Prometheus: 50Gi PV, 2Gi/4Gi memory, 500m/1000m CPU - OTel Demo: enabled in values.yaml - preference=_replica in k6 search queries Lesson: never use helm --reset-values or --set for config that should persist
…s/demo for clean deploy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
charts/observability-stack/— a Helm umbrella chart that deploys the full observability stack on Kubernetes, mirroring the existing docker-compose setup.Components (all upstream charts as dependencies)
What's included
Validated on
curl → OTel Collector → Data Prepper → OpenSearch✅Issues discovered and fixed during development
OPENSEARCH_INITIAL_ADMIN_PASSWORDenv varotlpsource pluginssl: false+peer_forwarder.ssl: falseopensearch-cluster-master(not{release}-opensearch)opensearch.username/opensearch.passwordexplicitlyexperimental.enabled_pluginsin DP configFollow-up items
sgguruda62324/opensearch-data-prepper:2.15.0-SNAPSHOTcustom build.values-staging.yamlfor opensearchstaging bleeding-edge images